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Abstract 


Cloud observation and cloud modeling data can be presented in 
histograms for each characteristic to be measured. Combining 
information from single-cloud histograms yields a summary histogram. 

Summary histograms can be compared to each other to reach 
conclusions about the behavior of an ensemble of clouds in different 
places at different times or about the accuracy of a particular cloud 
model. 

As in any scientific comparison, it is necessary to decide whether any 
apparent differences are statistically significant. The usual methods of 
deciding statistical significance when comparing histograms do not 
apply in this case because they assume independent data. Thus, a new 
method is necessary. The proposed method rises the Euclidean distance 
metric and bootstrapping to calculate the significance level. 

Introduction 

Cloud observation and cloud modeling create large amounts of data. The information from observa- 
tion and modeling data can be presented as a histogram for each characteristic or parameter of a given 
cloud. Combining information from several different clouds, whether real or modeled, requires combin- 
ing these histograms into a summary histogram. Summary histograms, which represent signals of a large 
data sample, can then be compared to each other to reach conclusions about the behavior of an ensemble 
of clouds in different places at different times or about the accuracy of a particular cloud model. 

As in any scientific comparison, it is necessary to decide whether any apparent differences are statisti- 
cally significant. The usual methods of deciding statistical significance when comparing histograms do 
not apply in this case because they assume the data are independent; thus, a new method is necessary. In 
this study, the proposed method is to choose a distance metric and use bootstrapping to calculate the 
significance level. Details of this method are described in this report. 

Satellite Data 

Observations of a cloud, either in a computer model or through satellite remote sensing, consist of 
measurements made in a grid, the points of which are referred to as “footprints” in satellite remote sens- 
ing and grid boxes in cloud modeling. Several different quantities are measured or inferred at each foot- 
print of eveiy cloud: solar insolation, short-wave reflected radiation, albedo, cloud optical depth, ice water 
path, cloud ice diameter, liquid water path, cloud droplet radius, outgoing long-wave radiation, emissiv- 
ity, cloud top temperature, cloud top height, cloud top pressure and sea surface temperature. Because the 
sizes of the satellite-observed clouds vary greatly, the total number of footprints measured in a cloud 
could be either fairly small or quite large. The data set used here, which consists of measurements made 
by the Clouds and the Earth’s Radiant Energy System (CERES; Wielicki et ah, 1996) instrument on 
the Tropical Rainfall Measuring Mission (TRMM) satellite during March 1998 contained 352 clouds 
varying in size from 74 footprints to 6883 footprints, with a mean of 545 footprints. Each footprint is 
typically 10-15 km in diameter. 



The CERES instrument observes the various quantities at each footprint, and histograms (recorded in a 
separate file for each cloud) summarize these values. The coded file names contain information about the 
date and location of the cloud. Because these “single-cloud” histograms contain data that were measured 
in the same cloud system, the data may not be independent. For example, if one footprint measures the 
cloud height at 15 km, it is likely that nearby footprints also have similar values for this measurement. 
Thus, a histogram reporting cloud top height that has a large number of observations in the bin centered at 
15.25 km is more likely to have observations in the neighboring bins centered at 14.75 km and 15.75 km, 
provided that the chosen bin size is 0.5 km. 

These single-cloud histograms are typically not studied alone but are combined with histograms from 
other clouds with certain common attributes. It is these summary histograms that are compared with each 
other. For example, 100 clouds from one geographic region could be combined into a histogram, which 
would be compared to a similar histogram summarizing 150 clouds from another geographic region. 
Model simulations of clouds could be compared to satellite observations of the same clouds, or clouds 
from one month/year could be compared to clouds from another month/year in the same geographic 
region. 

While the satellite footprint observations within each cloud are dependent, this study assumes that dif- 
ferent clouds are independent of each other. An argument could be made that the clouds are dependent 
based on similar weather dynamics over a large geographic region where many clouds are developed and 
maintained, for example, but any dependency involved would be difficult to quantify. It should be noted 
that if the bootstrapping procedure described subsequently is used to analyze data in which the different 
clouds are clearly dependent, it will fail to yield meaningful results. 

Measuring Differences in Histograms 

Several methods for evaluating the differences between two summary histograms were examined. 
They included standard goodness of fit tests from the statistical literature as well as several distance 
measures. The goodness of fit tests required independent assumptions and, as mentioned previously, the 
individual cloud histograms lack such independence. It is possible that a Chi-squared goodness of fit test 
could be modified to apply, but the exact nature of the modifications is a topic for future study. Thus, it 
was decided to use a measure of distance between histograms as a statistic. 

Several distance measures were examined. They were chosen because they had been used with good 
results in other applications that require comparison of histograms, for example, image retrieval, remote 
sensing, or object tracking. The characteristics of each measure were examined by comparing the 
behavior for simple test histograms similar to those in figure 1 below. Through this process, either the 
L 2 measure or the Jeffries-Matusita (JM) distance (also called the Hellinger distance) was used. 

The L 2 measure is the typical Euclidean distance between two vectors, which is defined by 



where a,- and bj are the proportion of values in the z'th bin of the respective histograms. If the histogram 
is reported as an approximate probability distribution function, in which the area contained by the 
histogram is one, this proportion (either a t or bj ) is found by multiplying the bin height by the bin width. 
It is possible to define a,- and bj as the bin heights without multiplying by the width. Such a change 
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would affect the value of L 2 but would not change the relative scale of smaller L 2 values to larger 
values. However, if the area contained by the histogram is one, the maximum value L 2 or JM, defined 

subsequently, can achieve is V2 . If desired, these values can be divided by V2 to ensure that the 
maximum value is one. 

The Jeffries-Matusita distance is defined by 


JM 



It has been used in applications, such as image retrieval, in which it is necessary to find small differences 
in data. Note that both these formulas assume that the histograms have bins with equal widths. 

Figure 1 demonstrates some of the differences in the behavior of these two distances. According to 
the L 2 distance, the second and third histograms are equally distant from the first histogram. However, 
when one uses the Jeffries-Matusita distance, the third histogram is farther away from the first than the 
second histogram is. This extra distance demonstrates that in the Jeffries-Matusita distance, differences in 
small bins are more important than differences in large bins, while in the L 2 distance, it is simply the 
magnitude of the difference that matters. 
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Figure 1. Behavior of Li distance versus Jeffries-Matusita distance for idealized histograms. 

Bootstrapping Procedure 

The formulas discussed previously define distances between histograms, but they do not address the 
question of statistical significance; that is, does the difference in the summary histograms imply that they 
came from two different populations? Bootstrapping will address this question. A short introduction 
to bootstrapping is included in appendix A, and a copy of a computer program that implements the 
algorithm discussed in this section is given in appendix B. It is important to note that it is possible for 
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bootstrapping to yield incorrect results. For example, it is possible to conclude that there is a difference 
in the underlying populations when it is not actually true. Flowever, this fact is true of any statistical 
procedure. 

The CERES/TRMM data from March 1998 were saved in 352 fdes. Each file contains histograms of 
data about a particular cloud and is named in a manner that describes the date and location of that cloud. 
The various clouds were each classified as being in the Eastern, Western, or Central Tropical Pacific 
regions. In the algorithm described subsequently, Region 1 denotes one of the three regions, while 
Region 2 denotes one of the remaining regions. The process for comparing Region 1 to Region 2 clouds 
is as follows: 

1. Make lists of file names for Region 1 and 2 clouds. In this case, there are 88 clouds classified as 
being in Region 1 and 135 classified as in Region 2. 

2. Merge the lists of file names for Regions 1 and 2. There are 223 clouds on the new list. Under the 
null hypothesis, these clouds come from the same population; thus, the choice of which clouds 
were in Region 1 and which were in Region 2 was equivalent to a random choice from the merged 
list under the null hypothesis. 

3. Choose 88 file names to represent the “Random 1” contingent of clouds by randomly sampling 
with a replacement from the list of 223 file names. Do the same to choose 135 clouds for the 
“Random 2” contingent of clouds. 

4. Create summary histograms for the new sets of “Random 1” and “Random 2” clouds and calculate 
the values of the distance measures. 

5. Compare the bootstrapped distance value between the “Random 1” and “Random 2” clouds to the 
distance value calculated for the true arrangement of the clouds. If the new value is larger than the 
true value, add one to a counter. Repeat steps 3-5 for a total of 5000 iterations. 

6. Divide the value of the counter by 5000. If this proportion is small, less than 5 percent for a 
95 percent confidence level, we have evidence that there is a difference between the cloud 
populations. If desired, the bootstrapped values can be stored and graphed to allow visualization of 
the true value location compared to the bootstrapped values. 

This algorithm tests the null hypothesis that all the clouds in the files are from the same population. If 
this hypothesis is true, then the clouds in Region 1 and Region 2 are essentially equivalent. Therefore, the 
distance between the histograms for the “true” ordering is essentially a random number picked from the 
sampling distribution of the bootstrapped distances. 

In the algorithm, the proportion of bootstrapped distances that are greater than the “true” distance is 
calculated. If the true distance is a random choice, as implied by the null hypothesis, this proportion 
could be any value between zero and one. A very small value for this proportion is evidence that the time 
distance was not a random choice from the sampling distribution. The proportion will be called the 
approximated significance level (ASL), and a value less than 0.05 will be evidence against the null 
hypothesis. 
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Results 


The algorithm previously described was implemented by using the March 1998 CERES/TRMM data. 
Details of this data set are given at <http://cloud-object.larc.nasa.gov>. They are not described here 
because this report focuses on the methodology for comparing the differences between summary histo- 
grams and presenting statistically significant tests. Results presented in this section compare clouds in the 
Eastern Pacific to clouds in the Western Pacific. Similar comparisons between the Western and Central 
Pacific clouds and between the Eastern and Central Pacific clouds are not presented. 


Figure 2 shows the histograms (left panel), measured distances, and ASLs (right panel) of emissivity 
in the Eastern and Western Pacific. Emissivity is a measure of how strongly a body radiates energy and 
has a value between zero and one (Wallace and Hobbs, 1977). The histograms for the two locations 
overlap to such an extent that it is difficult even to see the line representing the Eastern Pacific. The 
measured distances agree, yielding very small values of L 2 and Jeffries-Matusita. The corresponding 
ASLs are quite large, an example of a situation in which we can clearly conclude that there is no differ- 
ence between the populations represented for the distribution of emissivity. 
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for JM 
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Figure 2. Histograms (left panel), measured distances, and the approximated significance levels (ASLs) (right 
panel) of emissivity for Eastern and Western Pacific clouds during the March 1998 period. 


Figure 3 shows the histograms of sea surface temperature (SST) associated with the clouds in the east- 
ern and western Pacific in March 1998. The histograms have similar shapes and are very concen- 
trated about their modes; however, the mode for the Eastern Pacific clouds is 302.25 K, while the mode 
for the Western Pacific clouds is 302.75 K. Also, the range of SSTs for the Eastern Pacific clouds is 
298.25-303.75 K, while the range for the Western Pacific clouds is much larger, from 292.25-304.75 K. 
This difference is reflected in fairly large values for L 2 and Jeffries-Matusita but is shown much more 
starkly in the extremely low ASL values. Thus, we can conclude that there is a difference in SSTs 
between the clouds in the two regions. 

Figures 4 and 5 show the histograms and numeric information for cloud optical depth and cloud ice 
diameter, respectively. Optical depth measures the depletion of a beam of radiation as a result of passing 
through a cloud layer (Wallace and Hobbs, 1977). The L 2 values for these graphs are both 0.027. 


5 



^2 

0.340 

ASL 

forZ-2 

0.001 

JM 

0.488 

ASL 
for JM 

0.0002 


Figure 3. Histograms (left panel), measured distances, and ASLs (right panel) of sea surface temperature (SST) for 
Eastern and Western Pacific clouds during the March 1998 period. 
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Figure 4. Histograms (left panel), measured distances, and ASLs (right panel) of cloud optical depth for Eastern and 
Western Pacific clouds during the March 1998 period. 
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Figure 5. Histograms (left panel), measured distances, and ASLs (right panel) of cloud ice diameter for Eastern and 
Western Pacific clouds during the March 1998 period. 


However, the corresponding ASLs are not similar to each other. The ASL for cloud optical depth is 
0.196, while the ASL for ice diameter is 0.700. This difference is a result of differences in the amount of 
variation in the original data and demonstrates that the significance level does not depend on the numeric 
value of the corresponding statistic, but on the relative size of the statistic in relation to the bootstrapped 
values. The ASL for the JM distance is also much higher in the ice diameter (>100 percent) than in cloud 
optical depth, although the JM distances are only different by less than 50 percent between these two 
quantities, further supporting the assertion that the ASL value is not directly related to the statistic. 

Figures 6-10 demonstrate further examples of histograms with their corresponding L 2 and Jeffries- 
Matusita values and the ASLs yielded by the bootstrapping algorithm. In each of these cases, the statisti- 
cal evidence does not imply a difference in the underlying populations, although visual inspections, which 
may misidentify the areas covered by the two curves as the distances, may suggest otherwise in some of 
them. There are two explanations for this apparent disagreement. One is that the data samples are not 
large enough. The data sample sizes are extremely small in liquid water path and cloud droplet radius 
because the cloud top heights of most of the cloud footprints are too high to contain the liquid phase 
clouds. Another reason is that the definitions of the L 2 and JM distances are not equivalent to the areas 
contained between the two PDFs, which are observed by visual inspections. 

The histograms of top-of-the-atmosphere (TOA) solar insolation shown in figure 1 1 are an example of 
disagreement between the L 2 statistic and the Jeffries-Matusita statistic. In this case, the ASL generated 
with the L 2 statistic implies that there is no difference between the underlying populations. However, the 
ASL generated with the Jeffries-Matusita statistic does imply a significant difference that is probably 
related to the large JM value (0.519) and the nearly single-point PDFs with rather different modes of TOA 
solar insolation for individual clouds. The randomized populations may produce small JM values when 
PDFs with similar modes are combined, which results in fewer smaller bins than in the true distribution of 
cloud populations. 

Figures 12-14 also demonstrate disagreement between the L 2 and JM distances. However, in these 
cases it is the significance level generated by the L 2 statistic that suggests that there is a difference in the 
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Figure 6. Histograms (left panel), measured distances, and ASLs (right panel) of shortwave reflected radiation flux 
for Eastern and Western Pacific clouds during the March 1998 period. 
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Figure 7. Histograms (left panel), measured distances, and ASLs (right panel) of ice water path (IWP) for 
Eastern and Western Pacific clouds during the March 1998 periods. 
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Figure 8. Histograms (left panel), measured distances, and ASLs (right panel) of cloud top pressure for Eastern and 
Western Pacific clouds during the March 1998 period. 
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Figure 9. Histogram (left panel), measured distances, and ASLs (right panel) of liquid water path for Eastern and 
Western Pacific clouds during the March 1998 period. 
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Figure 10. Histograms (left panel), measured distances, and ASLs (right panel) of cloud droplet radius for Eastern 
and Western Pacific clouds during the March 1998 period. 
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Figure 11. Histograms (left panel), measured distances, and ASLs (right panel) of top-of-the-atmosphere (TOA) 
solar insolation for Eastern and Western Pacific clouds during the March 1998 period. 
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Figure 12. Histograms (left panel), measured distances, and ASLs (right panel) of TOA albedo for Eastern and 
Western Pacific clouds during the March 1998 period. 
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Figure 13. Histogram (left panel), measured distances, and ASLs (right panel) of outgoing longwave radiation for 
Eastern and Western Pacific clouds during the March 1998 period. 
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Figure 14. Histograms (left panel), measured distances, and ASLs (right panel) of cloud top temperature for Eastern 
and Western Pacific clouds during the March 1998 period. 
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Figure 15. Histograms (left panel), measured distances, and ASLs (right panel) of cloud top height for Eastern and 
Western Pacific clouds during the March 1998 period. 


populations. The significance levels generated by the Jeffries-Matusita statistic do not support the con- 
clusion that the populations are different. Figure 15 is included in this group because the approximated 
significance level for the L 2 statistic in this case is only slightly larger than 0.05. Technically, an ASL 
that is larger than 0.05 does not support the hypothesis that two populations are different. 

The final choice of which statistic to use should be based on each of their behaviors in distinguishing 
histograms. Figures 11-15 will help in this endeavor because they are examples in which the two statis- 
tics yield different conclusions. The choice of which is more accurate should depend on the nature of the 
properties being measured. The difference between the Eastern and Western Pacific solar insolation data 
was determined to be significant by the Jeffries-Matusita distance, but not by the L 2 distance. Solar 
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insolation measures the amount of energy coming from the Sun. It changes based on time of year and 
latitude but does not change based on longitude (Wallace and Hobbs, 1977). Conversely, the difference 
between the Eastern and Western Pacific data for albedo, outgoing longwave radiation, cloud top tem- 
perature, and cloud top height was determined significant by the L 2 distance, but not by the Jeffries- 
Matusita distance. It would be reasonable for these quantities to differ based on location because of the 
climatological contrast between the two regions. During the El Nino period, the occurrence of cloud 
systems shifted from the Western Pacific to the Central and Eastern Pacific (Cess et al., 2001), but certain 
properties can still be different in the two regions. Based on these observations, it is recommended that 
the L 2 distance be used rather than the Jeffries-Matusita distance. 

Concluding Remarks 

Information about individual clouds is stored in histograms that can be combined to create summary 
histograms. These summary histograms can then be compared to summary histograms from other places, 
other times, other observational methods, or model simulations. It is necessary, therefore, to design a 
method to determine whether any differences in these histograms are statistically significant. 

One can determine statistical significance by choosing a statistic and then calculating the approxi- 
mated significance level by using a bootstrapping procedure. It is important to note that statistical signifi- 
cance is determined by comparing histograms to other histograms that are generated by the same type of 
data. That is, comparing the distances between histograms for emissivity to the distances between histo- 
grams for solar insolation does not yield any important information. Instead, the approximated signifi- 
cance levels can be compared, if desired. 

Comparing the behavior of two distance measures under the bootstrapping procedure leads to the 
recommendation that the L 2 statistic (the typical Euclidean distance between two vectors) be used rather 
than the Jeffries-Matusita distance. The graphs in which the two measures differ were examined, and the 
Jeffries-Matusita distance suggested that differences in solar insolation between Eastern Pacific and 
Western Pacific clouds were significant. The L 2 measure did not agree with this conclusion. The prop- 
erties of solar insolation argue against any difference in longitude having an effect on the value. Other 
graphs in which the L 2 measure suggested significance and the Jeffries-Matusita measure did not were 
examined. In these cases, it was reasonable to assume that longitude could affect the values of the quan- 
tities being measured. 

Therefore, it is recommended that the L 2 statistic be combined with a bootstrapping procedure to 
compare summary histograms to each other. Once the bootstrapping procedure has been applied, the 
approximated significance level that is generated can be compared to a desired significance level, for 
example 0.05, to reach a conclusion about whether the underlying populations differ. 
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Appendix A 

An Introduction to Bootstrapping 


Most statistical actions, such as finding confidence intervals or performing hypothesis tests, rely on 
knowing the distribution of the test statistic. For example, if we have collected numerical data that are 
normally distributed, we can show that the sample mean is normally distributed and the sample standard 
deviation, properly normalized, is distributed like a Chi-squared random variable. These facts are respon- 
sible for several standard statistics formulas. 

What if the distribution of the statistic is unknown? This lack of information could result if the 
distribution of the data is unknown or if the formula for calculating the statistic is complicated. Often, a 
distribution for the data is assumed anyway to simplify further work. This assumption, however, invites 
criticism of the final results and does not address complicated calculations. 

If we know the distribution of the data, we can approximate the distribution of the statistic by simu- 
lating the experiment many times and recording the value of the statistic for each trial, and the set of 
values for the statistic can be used to estimate any required quantity. This method simplifies the analysis 
for complicated calculations but does not address the question of whether we know the distribution of the 
data. 

Bootstrapping is based on the fact that the empirical distribution function — that is, a function that puts 
equal weight on each of the sample data points — forms an estimate of the distribution function of the data. 
To approximate the distribution of the test statistic, we will sample randomly with replacement from the 
data, calculate the value of the statistic, and repeat. 

An example may clarify the process. Suppose the following 15 numbers were gathered during an 
experiment: 


0.726 

0.712 

0.401 

0.892 

0.621 

0.902 

0.556 

0.020 

0.612 

0.346 

0.819 

0.539 

0.330 

0.950 

0.687 


The sample mean calculated from these data is X = 0.6075. What is a 95-percent confidence interval for 
the mean? If we can assume that the data are normally distributed, we can use the standard formula of 
X ± 1 .96 s/ yfn . This formula arises because the standard error — the standard deviation of the statistic — is 
s/ fn , and 1.96 standard deviations to the left and right of the center encloses 95 percent of the area of a 
normal curve. However, we have no way of verifying the normality assumption, and graphing these data 
in a histogram yields a graph that does not look like the familiar bell-shaped curve. Instead, we will 
approximate the distribution of X and take the numbers which enclose 95 percent of the area. 

To approximate the distribution of X, first sample with a replacement from the previous 15 numbers. 
For example, the following numbers could result: 


0.902 

0.950 

0.539 

0.612 

0.621 

0.621 

0.712 

0.330 

0.712 

0.950 

0.621 

0.401 

0.020 

0.556 

0.687 
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Notice that while some of the numbers are repeated, there are no numbers that were not in the original 
sample. The new sample mean is X j =0.6156. Resampling 1000 times yields a set of sample means, 
which are plotted in the following histogram (see fig. Al): 



Figure Al. Histogram of sample means. 


Sorting the values X 1 ,...,A 10 oo allows us to see that 95 percent of the values are in the interval 
(0.4773, 0.7227), which then serves as a confidence interval for u. (The reader may notice that the distri- 
bution of X looks like a normal distribution. Figure Al is a demonstration of the Central Limit Theorem 
and will not necessarily be the case for other test statistics.) 

To demonstrate the use of bootstrapping in hypothesis testing, suppose we have another set of data, in 
addition to the first set, and the goal is to determine whether they came from the same distribution: 
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The null hypothesis is that these two sets of data are from the same distribution, and the alternate 
hypothesis will be that they are not. For the sake of argument, suppose we have chosen X-Y as a test 
statistic. Then, for these data, X-Y = 0.6075 -0.3435 =0.2640 . Does this example show a statistically 
significant difference? 

Under the null hypothesis, the two samples shown in the previous charts are from the same distribu- 
tion, so merging the samples yields an approximation of the underlying distribution. First, choose two 
sets of 15 numbers each by sampling with replacement from the entire set of 30 numbers. For example: 
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0.20 



Difference in sample means 


Figure A2. Histogram of differences. 


The difference in sample means for this choice of numbers is X^-Y^ = -0.0750. Repeating this exercise 
1000 times yields values X\ -F 1 ,...,Z 10 oo “^1000 > which are collected in figure A2. 

There are 7 values greater than 0.2640 and 13 values less than -0.2640, so p = 20/1000 = 0.02. 
Assuming a significance level a = 0.05, the fact that the p-w alue is smaller than a implies that the differ- 
ence is statistically significant. The null hypothesis that the two sets of data came from the same distri- 
bution is rejected. In effect, if the two sets of data had come from the same population, the original 
number we obtained would likely be close to the center of this distribution. Since it is in one of the tails 
of the distribution, we have evidence that the assumption that led to the creation of this distribution must 
be incorrect. 
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Appendix B 
Program Code 


This appendix is a copy of a FORTRAN 77 program that implements the bootstrapping algorithm 
described in the text. The program assumes that it will be run in the same folder or directoiy as files 
containing formatted information about clouds. It also assumes that the names of these files contain a 
number as the 19th character, which describes whether the cloud is from the Eastern, Central, or Western 
Pacific. Finally, it assumes that a file named ' filenames . txt ' exists that lists the file names in the 
directoiy. Such a file can easily be created in DOS with the command c : >dir S* > temp . txt, or 
in UNIX with the command >ls S* | cat temp.txt. 

This program took about three hours to run. Most likely, the bulk of the time is used to open and close 
files. If so, the run time can be reduced by storing the values in the files at the beginning of the program 
run and then referring to the stored values rather than rereading the files. 

program eastwest 

C ***** variable declarations ***** 

parameter ( numf iles=352 , maxbin=400, numiter=5000 ) 
character*24 tempname, east ( numf iles ) , 

+ west ( numf iles ) 

integer eastindex, westindex, 

+ eastlength, westlength, 

+ easti ( numf iles ) , westi ( numf iles ) 

real eastloc (maxbin, 14), aal, bbl, binsize(14), 

+ eastcount( maxbin, 14), eastbintotal ( 14 ) , 

+ westloc (maxbin, 14), westbintotal ( 14 ) , 

+ westcountf maxbin, 14), 

+ L2 ( 14 ) , JM (14), L2boot( 14 ) , 

+ JMboot ( 14 ) , pL ( 14 ) , pJ ( 14 ) 

ivar=l 

eastindex=l 

westindex=l 

C *****read in list of filenames ***** 

C *****requires file containing a list of all filenames***** 

C *****also requires the 19th character in the filename to ***** 

C * * * * *p e a number, 1,2, or 3, which signifies east, west, or central**** 

open (unit=20, f ile= ' filenames . txt ' , 

+ f orm= ' formatted ' , access= ' sequential ' ) 

50 read ( 20 , * , end=100 ) tempname 
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if (tempname( 19 : 19 ) .eq. ' 1 ' ) then 
east ( eastindex ) =tempname 
eastindex=eastindex+l 
elseif (tempname( 19 : 19 ) .eq. ' 2 ' ) then 
west ( we st index ) =tempname 
westindex=westindex+l 
elseif (tempname( 19 : 19 ) .eq. ' 3 ' ) then 
C central ( centralindex ) =tempname 

C centralindex=centralindex+l 

else 

print * , ' error ! ' 

go to 100 

endif 


ivar=ivar+l 
go to 50 


100 continue 

close (20) 

length=ivar- 1 

east length=east index- 1 

west length=west index- 1 

C print *, length, eastlength, westlength 

C *****ra.ST***** 

C print *, 'East' 

C *****Initialize variables***** 


do ivar=l,14 

do n=l,maxbin 

eastcountjn, ivar)=0.0 

enddo 

eastbintotal ( ivar ) =0 

enddo 


C ***** Read in data from files ***** 

do jvar=l , eastlength 

open ( unit=2 1 , f ile=east ( jvar ) , 

+ form= ' formatted ' , access= ' sequential ' ) 
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nvar=l 


125 read(21, *, end=150) nbin, aal, bbl, lbinl 

C ***** Note, line below should be if (aal .ne. -999.00), but I got error 

C ***** messages for using not equal with a real variable. 

if (aal .le. -999.01 .or. aal .ge. -998.99) then 

eastloc (nbin, nvar) = aal 

eastcount (nbin, nvar) = eastcount (nbin, nvar) + 

+ float (lbinl) 


else 

eastbintotal (nvar) = eastbintotal (nvar) + lbinl 
binsize (nvar) = bbl 

nvar=nvar+l 

endif 

go to 125 
150 continue 

close (21) 

enddo 


C *****west***** 

C print *, 'West' 

C *****Initialize variables***** 


do ivar=l,14 

do n=l,maxbin 

westcountjn, ivar)=0.0 

enddo 

westbintotal ( ivar ) =0 

enddo 

C ***** Read in data from files ***** 

do jvar=l , westlength 

open ( unit=2 1 , f ile=west ( jvar ) , 

+ form= ' formatted ' , access= ' sequential 1 ) 
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nvar=l 


225 read(21, *, end=250) nbin, aal, bbl, lbinl 

C ***** Note, line below should be if (aal .ne. -999.00), but I got error 

C ***** messages for using not equal with a real variable. 

if (aal .le. -999.01 .or. aal .ge. -998.99) then 

westloc (nbin, nvar) = aal 

westcount (nbin, nvar) = westcount (nbin, nvar) + 

+ float (lbinl) 

else 

westbintotal (nvar) = westbintotal (nvar) + lbinl 
binsize (nvar) = bbl 

nvar=nvar+l 

endif 
go to 225 

250 continue 

close (21) 

enddo 


C *****Calculate metrics***** 


do ivar=l,14 
L2 ( ivar ) =0 . 0 
JM ( ivar ) =0 . 0 

do n=l,maxbin 

L2 ( ivar ) =L2 ( ivar ) + 

+ (eastcount(n, ivar ) /eastbintotal ( ivar ) - 

+ westcount(n, ivar ) /westbintotal ( ivar )) **2 

JM ( ivar )=JM( ivar) + 

+ ( sqrt ( eastcount ( n, ivar ) /eastbintotal ( ivar ) ) - 

+ sqrt (westcount (n, ivar ) /westbintotal ( ivar ))) **2 

enddo 

L2 ( ivar ) =sqrt ( L2 ( ivar ) ) 

JM ( ivar ) =sqrt ( JM ( ivar ) ) 

C write (6,*) ivar, ' L2 : ' , L2 ( ivar ) , ' JM: ' , JM(ivar) 
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c 


write (6,*) '-999.00', binsize ( ivar ) 

enddo 

C *****Now to do the random number set-up ***** 

call date_time_seed@ 

print *, 'random' 

do ivar=l,14 

pL(ivar)=0 
pj ( ivar )=0 

enddo 

do kvar= 1 , numiter 
C print * , kvar 

do jvar=l , eastlength 

easti ( jvar ) =nint ( random) ) * ( eastlength+westlength ) +0 . 5 ) 
C write (6,*) easti (jvar) 

enddo 

do jvar=l , westlength 

westi ( jvar ) =nint ( random) ) * ( eastlength+westlength ) +0 . 5 ) 
C write (6,*) westi (jvar) 

enddo 

q ***** "EAST" ***** 

C print *, 'East' 

C *****Initialize variables***** 

do ivar=l,14 

do n=l,maxbin 

eastcountjn, ivar) =0.0 

enddo 

eastbintotal ( ivar ) =0 

enddo 

C ***** Read in data from files ***** 

do jvar=l , eastlength 

C write(6,*) easti(jvar), east ( easti ( jvar ) ) 

if (easti) jvar) .le. eastlength) then 

open ( unit=2 1 , f ile=east ( easti ( jvar ) ) , 

+ f orm= ' formatted ' , access= ' sequential ' ) 

else 

open ( unit=2 1 , f ile=west ( easti ( jvar ) -eastlength ) , 

+ f orm= ' formatted ' , access= ' sequential ' ) 
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endif 


nvar=l 

425 read(21, *, end=450) nbin, aal, bbl, lbinl 

C ***** Note, line below should be if (aal .ne. -999.00), but I got error 

C ***** messages for using not equal with a real variable. 

if (aal .le. -999.01 .or. aal . ge . -998.99) then 

eastloc (nbin, nvar) = aal 

eastcount (nbin, nvar) = eastcount (nbin, nvar) + 

+ float( lbinl) 

else 

eastbintotal (nvar) = eastbintotal (nvar) + lbinl 
binsize (nvar) = bbl 

nvar=nvar+l 

endif 

go to 425 

450 continue 

close (21) 

enddo 


C ***** "WEST" ***** 

C print *, 'West' 

C *****Initialize variables***** 

do ivar=l,14 

do n=l,maxbin 

westcount(n, ivar)=0.0 

enddo 

westbintotal ( ivar ) =0 

enddo 

C ***** Read in data from files ***** 

do jvar=l , westlength 

if (westi(jvar) .le. eastlength) then 

open ( unit=2 1 , f ile=east ( westi ( jvar ) ) , 


23 



+ f orm= ' formatted ' , access= ' sequential 1 ) 

else 

open ( unit=2 1 , f ile=west ( westi ( jvar ) -eastlength ) , 

+ f orm= ' formatted ' , access= ' sequential 1 ) 

endif 

nvar=l 

525 read(21, *, end=550) nbin, aal, bbl, lbinl 

C ***** Note, line below should be if (aal .ne. -999.00), but I got error 

C ***** messages for using not equal with a real variable. 

if (aal .le. -999.01 .or. aal . ge . -998.99) then 

westloc (nbin, nvar) = aal 

westcount (nbin, nvar) = westcount (nbin, nvar) + 

+ float(lbinl) 

else 

westbintotal (nvar) = westbintotal (nvar) + lbinl 
binsize (nvar) = bbl 

nvar=nvar+l 
endif 
go to 525 

550 continue 

close (21) 

enddo 


C *****Calculate metrics***** 

do ivar=l,14 
L2boot ( ivar ) =0 . 0 
JMboot ( ivar ) =0 . 0 

do n=l,maxbin 

L2boot ( ivar ) =L2boot ( ivar ) + 

+ (eastcount(n, ivar ) /eastbintotal ( ivar ) - 

+ westcount(n, ivar ) /westbintotal ( ivar )) **2 

JMboot ( ivar )= JMboot ( ivar ) + 

+ ( sqrt ( eastcount ( n, ivar ) /eastbintotal ( ivar ) ) - 

+ sqrt (westcount ( n, ivar ) /westbintotal ( ivar ))) **2 

enddo 
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L2boot ( ivar ) =sqrt ( L2boot ( ivar ) ) 
JMboot ( ivar ) =sqrt ( JMboot ( ivar ) ) 

if ( L2boot ( ivar ) .gt. L2(ivar)) then 
pL ( ivar ) =pL ( ivar ) + 1 

else 

endif 

if ( JMboot ( ivar ) .gt. JM(ivar)) then 
p J ( ivar ) =p J ( ivar ) + 1 

else 

endif 

enddo 

enddo 


C ***** output results ***** 

print * , ' output ' 


+ 


+ 


do ivar=l,14 
write ( 6 , * ) 
' pvalue: 
write ( 6 , * ) 
' pvalue: 
write ( 6 , * ) 
enddo 


ivar, ' L2 : ' , L2(ivar), 
, pL ( ivar ) /numiter 
ivar, ' JM: ' , JM(ivar), 
, pJ ( ivar ) /numiter 


end 
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